Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

86 ◾ Bioinformatics

mapping because the aligner should also be able to detect the splice junctions. The process

of alignment is usually complicated by the possible existence of mismatches, which may

be due to base call errors or due to genetic variations in the individual genome. In general,

aligners are required to use a strategy that enables them to perform both an exact search

and an inexact search to allow locating positions of reads with mismatches. Almost all

read aligners perform alignment in two major steps: indexing of the sequence of the refer-

ence genome and finding the most likely locations of the reads in the reference genome.

The FASTA sequence of the reference of an organism can be downloaded from genome

databases such as NCBI Genome and UCSC database. The FASTA genome sequence is

indexed first by the “samtools faidx” command to allow fast processing by the aligners.

The commonly used data structures for genome indexing include BWT, FM-index, suf-

fix arrays, and hash table for their memory efficiency and capability to store a genome

sequence. There are a variety of aligners that use different indexing and lookup algorithms.

We discussed only BWA, Bowtie2, and STAR. However, those are only examples. Before

using an aligner, you may need to know its memory efficiency and whether it is capable to

use short reads, long reads, or both. If you have RNA-Seq reads, you may also need to know

whether that aligner is capable to detect splice junctions or not. Both BWA and Bowtie2 are

general purpose aligners that can be used for all kinds of reads and they can operate well

on a desktop computer with 32GB of RAM or more. STAR is better for RNA-Seq read and

it can also run on a desktop computer but it requires much more memory for both index-

ing and mapping.

Almost all aligners produce SAM/BAM files, which store read mapping information.

A SAM/BAM file consists of a header section and an alignment section. The alignment

section includes nine mandatory columns; each row of the columns contains the mapping

information of a read. The alignment information includes read name, FLAG, reference

sequence name (e.g., chromosome name or accession), position in the reference sequence

(coordinate), mapping quality, CIGAR string, reference name of the mate, position of the

mate, read length, segment of the read sequence, and Phred base quality. FLAG field stores

standard codes that describe the alignment (e.g., unmapped reads, duplicate reads, and

chimeric alignments). The CIGAR string describes the operations that took place on the

reads such as matches, mismatches, insertions, and deletions.

SAM/BAM files can be manipulated by some programs like Samtools and PICARD.

The file manipulation includes format conversion, indexing, sorting, displaying, statistics,

viewing, and filtering.

The SAM/BAM files are used in the downstream data analysis such as reference-guided

genome assembly, variant discovery, gene expression (RNA-Seq data analysis), epigenetics

(ChIP-Seq data analysis), and metagenomics as we will discuss in coming chapters.

REFERENCES

1. Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the

amino acid sequence of two proteins. J Mol Biol 1970, 48(3):443–453.

2. Chacón A, Moure JC, Espinosa A, Hernández P: n-step FM-index for faster pattern matching.

Proc Comput Sci 2013, 18:70–79.